Diabetes Prediction Using Machine Learning¶
Overview¶
This project focuses on building a machine learning model to predict the likelihood of an individual being diabetic, pre-diabetic, or healthy. By analyzing healthcare statistics and lifestyle factors, the project aims to assist in early detection and intervention, enabling better diabetes management and prevention strategies.
Project Goals¶
- Understand the relationship between healthcare and lifestyle statistics and diabetes risk.
- Build a reliable classification model using advanced machine learning techniques.
- Provide actionable insights through feature analysis and evaluation metrics.
Features¶
- Data Preprocessing: Handling missing values, outliers, class imbalances, and encoding categorical variables.
- Feature Selection: Identifying key factors influencing diabetes risk using correlation analysis and feature importance algorithms.
- Model Development: Implementing and evaluating various machine learning models (e.g., Logistic Regression, Random Forest, Gradient Boosting, SVM).
- Evaluation Metrics: Assessing models using precision, recall, F1-score, accuracy, and AUC for robust validation.
- Presentation & Reporting: Summarizing the results, insights, and recommendations in an accessible format.
Methodology¶
- Data Preparation:
- Collect and preprocess healthcare and lifestyle data.
- Resolve discrepancies such as missing values, outliers, and imbalances.
- Feature Selection & Model Building:
- Identify significant predictors of diabetes.
- Compare machine learning algorithms to finalize the best-performing model.
- Model Evaluation:
- Validate the model using multiple performance metrics.
- Ensure robustness through cross-validation techniques.
- Documentation & Deployment:
- Prepare detailed documentation and presentations.
- Finalize the project for real-world applications.
Technologies Used¶
- Programming Language: Python
- Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, XGBoost
- Tools: Jupyter Notebook, GitHub
Expected Outcomes¶
- A machine learning model that accurately predicts diabetes risk.
- Insights into the impact of lifestyle factors on diabetes.
- A comprehensive framework for healthcare professionals to support early diagnosis and preventative care.
Importing Libraries¶
- Pandas : Data manipulation and analysis.
- Matplotlib : Basic data visualization.
- Scikit-learn : Machine learning and preprocessing.
- Plotly Express : Interactive data visualization.
#importing the packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
Diabetes= pd.read_csv('diabetesInfosys.csv') # loading the dataset
Diabetes.head(10) # Displays top 10 records of the dataset
| Age | Gender | Polyuria | Polydipsia | sudden weight loss | weakness | Polyphagia | Genital thrush | visual blurring | Itching | Irritability | delayed healing | partial paresis | muscle stiffness | Alopecia | Obesity | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | Male | No | Yes | No | Yes | No | No | No | Yes | No | Yes | No | Yes | Yes | Yes | Positive |
| 1 | 58 | Male | No | No | No | Yes | No | No | Yes | No | No | No | Yes | No | Yes | No | Positive |
| 2 | 41 | Male | Yes | No | No | Yes | Yes | No | No | Yes | No | Yes | No | Yes | Yes | No | Positive |
| 3 | 45 | Male | No | No | Yes | Yes | Yes | Yes | No | Yes | No | Yes | No | No | No | No | Positive |
| 4 | 60 | Male | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Positive |
| 5 | 55 | Male | Yes | Yes | No | Yes | Yes | No | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Positive |
| 6 | 57 | Male | Yes | Yes | No | Yes | Yes | Yes | No | No | No | Yes | Yes | No | No | No | Positive |
| 7 | 66 | Male | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | No | Yes | Yes | No | No | Positive |
| 8 | 67 | Male | Yes | Yes | No | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes | No | Yes | Positive |
| 9 | 70 | Male | No | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | No | No | Yes | No | Positive |
Preparing the Dataset¶
Checking for missing/null values.
Examining the information in the columns.
The fundamental statistics of the numeric column.
Diabetes.isnull().sum()
Age 0 Gender 0 Polyuria 0 Polydipsia 0 sudden weight loss 0 weakness 0 Polyphagia 0 Genital thrush 0 visual blurring 0 Itching 0 Irritability 0 delayed healing 0 partial paresis 0 muscle stiffness 0 Alopecia 0 Obesity 0 class 0 dtype: int64
Diabetes.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 520 entries, 0 to 519 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 520 non-null int64 1 Gender 520 non-null object 2 Polyuria 520 non-null object 3 Polydipsia 520 non-null object 4 sudden weight loss 520 non-null object 5 weakness 520 non-null object 6 Polyphagia 520 non-null object 7 Genital thrush 520 non-null object 8 visual blurring 520 non-null object 9 Itching 520 non-null object 10 Irritability 520 non-null object 11 delayed healing 520 non-null object 12 partial paresis 520 non-null object 13 muscle stiffness 520 non-null object 14 Alopecia 520 non-null object 15 Obesity 520 non-null object 16 class 520 non-null object dtypes: int64(1), object(16) memory usage: 69.2+ KB
Diabetes.describe()
| Age | |
|---|---|
| count | 520.000000 |
| mean | 48.028846 |
| std | 12.151466 |
| min | 16.000000 |
| 25% | 39.000000 |
| 50% | 47.500000 |
| 75% | 57.000000 |
| max | 90.000000 |
EDA¶
This Exploratory Data Analysis (EDA) step focuses on preparing data for modeling by addressing:
Duplicates : Eliminate duplicates to maintain data uniqueness.
Missing Values : Identify and impute or remove based on feature relevance.
Outliers : Detect and manage with Z-score or IQR to avoid model bias.
Data Consistency : Standardize data types for reliable model compatibility.
This EDA phase ensures data quality and readiness for accurate modeling.
import matplotlib.pyplot as plt
# Count the occurrences of each class (positive/negative)
class_counts = Diabetes['class'].value_counts()
# Custom colors for the pie chart
colors = ['#1f77b4', '#ff7f0e'] # Blue and Orange
# Create the pie chart
plt.figure(figsize=(6, 6))
plt.pie(class_counts, labels=class_counts.index, autopct='%1.1f%%', startangle=140, colors=colors)
plt.title("Ratio of Positive and Negative Cases")
plt.show()
# For Creating Interactive graphs
gendis= px.histogram(Diabetes, x = 'Gender', color = 'class', title="Distribution of Positive vs. Negative Diabetes Cases by Gender")
gendis.show()
pltbl= ['Gender', 'class']
cm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[pltbl[0]],Diabetes[pltbl[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = cm)
| class | Negative | Positive |
|---|---|---|
| Gender | ||
| Female | 9.500000 | 54.060000 |
| Male | 90.500000 | 45.940000 |
The data shows that female patients have a higher positivity rate than male patients, suggesting a bias toward female patients with higher positivity.
polyuria=px.histogram(Diabetes, x = 'Polyuria', color = 'class', title="Polyuria Frequency by Diabetes Status",
labels={"Polyuria": "Polyuria (Frequent Urination)", "count": "Number of Cases", "class": "Diabetes Status"})
polyuria.show()
plttbl_polyuria= ['Polyuria', 'class']
cm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plttbl_polyuria[0]], Diabetes[plttbl_polyuria[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = cm)
| class | Negative | Positive |
|---|---|---|
| Polyuria | ||
| No | 92.500000 | 24.060000 |
| Yes | 7.500000 | 75.940000 |
If a patient has polyuria (frequent urination), there's a 76% chance they could have diabetes. If they don't have polyuria, there's a 92% chance they won't get diabetes.
polydispia = px.histogram(Diabetes, x = 'Polydipsia', color = 'class', title="Frequency of Increased Water Consumption (Polydipsia) by Diabetes Status",
labels={"Polydipsia": "Polydipsia (Increased Water Consumption)", "count": "Number of Cases", "class": "Diabetes Status"})
polydispia.show()
plttblpolydispia= ['Polydipsia', 'class']
rm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plttblpolydispia[0]], Diabetes[plttblpolydispia[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = rm)
| class | Negative | Positive |
|---|---|---|
| Polydipsia | ||
| No | 96.000000 | 29.690000 |
| Yes | 4.000000 | 70.310000 |
If a person has polydipsia (excessive thirst), there's a 70% chance they will develop diabetes. If they don’t have polydipsia, there's a 96% chance they won’t get diabetes.
swl = px.histogram(Diabetes, x = 'sudden weight loss', color = 'class', title="Distribution of Sudden Weight Loss by Diabetes Status",
labels={"sudden weight loss": "Sudden Weight Loss", "count": "Number of Cases", "class": "Diabetes Status"})
swl.show()
plttblswl= ['sudden weight loss', 'class']
qm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plttblswl[0]], Diabetes[plttblswl[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = qm)
| class | Negative | Positive |
|---|---|---|
| sudden weight loss | ||
| No | 85.500000 | 41.250000 |
| Yes | 14.500000 | 58.750000 |
Unexpected weight loss is linked to a 58% chance of having diabetes. However, other common illnesses can also cause weight loss, so it's not always a definitive sign of diabetes. Unexpected weight loss is an important indicator, but it is less significant than Polyuria (frequent urination) or Polydipsia (excessive thirst) when predicting diabetes.
swl = px.histogram(Diabetes, x = 'weakness', color = 'class', title="Distribution of Weakness by Diabetes Status",
labels={"weakness": "Weakness", "count": "Number of Cases", "class": "Diabetes Status"})
swl.show()
wkns = ['weakness', 'class']
sm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[wkns [0]],Diabetes[wkns [1]], normalize='columns') * 100,2)).style.background_gradient(cmap = sm)
| class | Negative | Positive |
|---|---|---|
| weakness | ||
| No | 56.500000 | 31.870000 |
| Yes | 43.500000 | 68.120000 |
Individuals with weakness have a 68% chance of testing positive for diabetes.
eating = px.histogram(Diabetes, x = 'Polyphagia', color = 'class', title="Distribution of Polyphagia (Excessive Eating) by Diabetes Status",
labels={"Polyphagia": "Polyphagia (Excessive Eating)", "count": "Number of Cases", "class": "Diabetes Status"})
eating.show()
plt_eating= ['Polyphagia', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_eating[0]], Diabetes[plt_eating[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
| class | Negative | Positive |
|---|---|---|
| Polyphagia | ||
| No | 76.000000 | 40.940000 |
| Yes | 24.000000 | 59.060000 |
Individuals with an obsessive eating disorder have a 59% chance of developing diabetes, but a 76% chance of not developing it, indicating a lower risk for diabetes.
gntlthrsh = px.histogram(Diabetes, x = 'Genital thrush',color='class',title="Genital Thrush Distribution by Diabetes Status",
labels={"Genital thrush": "Genital Thrush", "count": "Number of Cases", "class": "Diabetes Status"})
gntlthrsh.show()
plt_thrsh= ['Genital thrush', 'class']
um = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_thrsh[0]], Diabetes[plt_thrsh[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = um)
| class | Negative | Positive |
|---|---|---|
| Genital thrush | ||
| No | 83.500000 | 74.060000 |
| Yes | 16.500000 | 25.940000 |
Individuals with genital thrush have a 25.94% chance of testing positive for diabetes, while those without genital thrush have a 74.06% chance of testing positive.
visual = px.histogram(Diabetes, x = 'visual blurring', color = 'class', title="Visual Blurring Distribution by Diabetes Status",
labels={"visual blurring": "Visual Blurring", "count": "Number of Cases", "class": "Diabetes Status"})
visual.show()
plt_blurring= ['visual blurring', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_blurring[0]], Diabetes[plt_blurring[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
| class | Negative | Positive |
|---|---|---|
| visual blurring | ||
| No | 71.000000 | 45.310000 |
| Yes | 29.000000 | 54.690000 |
Individuals with visual blurring have a 54.69% chance of testing positive for diabetes, while those without visual blurring have a 45.31% chance of testing positive.
creeping = px.histogram(Diabetes, x = 'Itching', color = 'class', title="Distribution of Itching (Creeping) Symptom by Diabetes Status",
labels={"Itching": "Itching (Creeping)", "count": "Number of Cases", "class": "Diabetes Status"})
creeping.show()
plt_creeping= ['Itching', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_creeping[0]], Diabetes[plt_creeping[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
| class | Negative | Positive |
|---|---|---|
| Itching | ||
| No | 50.500000 | 51.880000 |
| Yes | 49.500000 | 48.120000 |
Individuals with itching have a 48.12% chance of testing positive for diabetes, while those without itching have a 51.88% chance of testing positive. This shows that itching has a minimal impact on the likelihood of testing positive for diabetes.
irritiability = px.histogram(Diabetes, x = 'Irritability', color = 'class', title="Distribution of Irritability Symptom by Diabetes Status",
labels={"Irritability": "Irritability", "count": "Number of Cases", "class": "Diabetes Status"})
irritiability.show()
plt_irritiability= ['Irritability', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_irritiability[0]], Diabetes[plt_irritiability[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
| class | Negative | Positive |
|---|---|---|
| Irritability | ||
| No | 92.000000 | 65.620000 |
| Yes | 8.000000 | 34.380000 |
Individuals with irritability have a 34.38% chance of testing positive for diabetes, while those without irritability have a 65.62% chance of testing positive. This suggests that irritability is associated with a lower likelihood of testing positive for diabetes.
dh = px.histogram(Diabetes, x = 'delayed healing', color = 'class', title="trouble staying closed")
dh.show()
plt_dh= ['delayed healing', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_dh[0]], Diabetes[plt_dh[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
| class | Negative | Positive |
|---|---|---|
| delayed healing | ||
| No | 57.000000 | 52.190000 |
| Yes | 43.000000 | 47.810000 |
Individuals with delayed healing have a 47.81% chance of testing positive for diabetes, while those without delayed healing have a 52.19% chance of testing positive. This indicates that delayed healing has a minimal impact on the likelihood of testing positive for diabetes.
paresis = px.histogram(Diabetes, x = 'partial paresis', color = 'class', title="partial paresis")
paresis.show()
plt_paresis= ['partial paresis', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_paresis[0]], Diabetes[plt_paresis[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
| class | Negative | Positive |
|---|---|---|
| partial paresis | ||
| No | 84.000000 | 40.000000 |
| Yes | 16.000000 | 60.000000 |
Individuals with partial paresis have a 60% chance of testing positive for diabetes, while those without partial paresis have a 40% chance of testing positive.
muscle_stiffness = px.histogram(Diabetes, x = 'muscle stiffness', color = 'class', title="muscle stiffness")
muscle_stiffness.show()
plt_stiffness= ['muscle stiffness', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_stiffness[0]], Diabetes[plt_stiffness[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
| class | Negative | Positive |
|---|---|---|
| muscle stiffness | ||
| No | 70.000000 | 57.810000 |
| Yes | 30.000000 | 42.190000 |
Individuals with muscle stiffness have a 42.19% chance of testing positive for diabetes, while those without muscle stiffness have a 57.81% chance of testing positive. This indicates that muscle stiffness is associated with a slightly lower likelihood of testing positive for diabetes.
Hair_loss = px.histogram(Diabetes, x = 'Alopecia', color = 'class', title="Hair Loss")
Hair_loss.show()
plt_Hair_loss= ['Alopecia', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_Hair_loss[0]], Diabetes[plt_Hair_loss[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
| class | Negative | Positive |
|---|---|---|
| Alopecia | ||
| No | 49.500000 | 75.620000 |
| Yes | 50.500000 | 24.380000 |
Individuals with alopecia have a 24.38% chance of testing positive for diabetes, while those without alopecia have a 75.62% chance of testing positive. This suggests that alopecia is associated with a lower likelihood of testing positive for diabetes
Obesity = px.histogram(Diabetes, x = 'Obesity', color = 'class', title="excessive body fat")
Obesity.show()
plt_body_fat= ['Obesity', 'class']
tm = sns.light_palette("green", as_cmap=True)
(round(pd.crosstab(Diabetes[plt_body_fat[0]], Diabetes[plt_body_fat[1]], normalize='columns') * 100,2)).style.background_gradient(cmap = tm)
| class | Negative | Positive |
|---|---|---|
| Obesity | ||
| No | 86.500000 | 80.940000 |
| Yes | 13.500000 | 19.060000 |
Individuals with obesity have a 19.06% chance of testing positive for diabetes, while those without obesity have an 80.94% chance of testing positive. This suggests that obesity is associated with a reduced likelihood of testing positive for diabetes in this dataset.
Label Encoding¶
from sklearn import preprocessing
from sklearn import model_selection
number = preprocessing.LabelEncoder()
dtacpy1 = Diabetes.copy() # Duplicating the Dataset
dtacpy1.head(5)
| Age | Gender | Polyuria | Polydipsia | sudden weight loss | weakness | Polyphagia | Genital thrush | visual blurring | Itching | Irritability | delayed healing | partial paresis | muscle stiffness | Alopecia | Obesity | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | Male | No | Yes | No | Yes | No | No | No | Yes | No | Yes | No | Yes | Yes | Yes | Positive |
| 1 | 58 | Male | No | No | No | Yes | No | No | Yes | No | No | No | Yes | No | Yes | No | Positive |
| 2 | 41 | Male | Yes | No | No | Yes | Yes | No | No | Yes | No | Yes | No | Yes | Yes | No | Positive |
| 3 | 45 | Male | No | No | Yes | Yes | Yes | Yes | No | Yes | No | Yes | No | No | No | No | Positive |
| 4 | 60 | Male | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Positive |
for i in dtacpy1:
dtacpy1[i] = number.fit_transform(dtacpy1[i])
dtacpy1.head()
| Age | Gender | Polyuria | Polydipsia | sudden weight loss | weakness | Polyphagia | Genital thrush | visual blurring | Itching | Irritability | delayed healing | partial paresis | muscle stiffness | Alopecia | Obesity | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 34 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
| 2 | 17 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
| 3 | 21 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 4 | 36 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
X = dtacpy1.drop(['class'],axis=1) # Independent
y= dtacpy1['class'] # Dependent
X.head()
| Age | Gender | Polyuria | Polydipsia | sudden weight loss | weakness | Polyphagia | Genital thrush | visual blurring | Itching | Irritability | delayed healing | partial paresis | muscle stiffness | Alopecia | Obesity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
| 1 | 34 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 2 | 17 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
| 3 | 21 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 36 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
y.head()
0 1 1 1 2 1 3 1 4 1 Name: class, dtype: int32
# Calculate the correlation of each feature with the target variable
correlation = X.corrwith(y)
# Print the correlation values for reference
print("Feature Correlations with Target Variable:\n", correlation)
# Enhanced Bar Plot for Correlation with custom color
plt.figure(figsize=(15, 5))
correlation.plot(
kind="bar",
color="coral", # Change bar color to coral
edgecolor="darkred",
linewidth=1,
title="Feature Correlation with Target Variable (Class)"
)
# Add grid and adjust plot aesthetics
plt.title("Correlation of Features with Target Variable", fontsize=16, fontweight='bold')
plt.xlabel("Features", fontsize=12)
plt.ylabel("Correlation Coefficient", fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
# Display the plot
plt.show()
Feature Correlations with Target Variable: Age 0.106419 Gender -0.449233 Polyuria 0.665922 Polydipsia 0.648734 sudden weight loss 0.436568 weakness 0.243275 Polyphagia 0.342504 Genital thrush 0.110288 visual blurring 0.251300 Itching -0.013384 Irritability 0.299467 delayed healing 0.046980 partial paresis 0.432288 muscle stiffness 0.122474 Alopecia -0.267512 Obesity 0.072173 dtype: float64
From the graph above, we can identify a strong correlation between the variable "Class" (indicating diabetes presence) and specific factors, listed in order of strongest positive relationship:
- Polyuria (frequent urination)
- Polydipsia (increased thirst)
- Sudden weight loss
- Partial paresis (muscle weakness)
These factors are positively correlated with the likelihood of diabetes, meaning patients showing these symptoms are more likely to be diagnosed as diabetic. This insight is key for identifying individuals at higher risk based on common symptoms.
On the other hand, variables that show a negative correlation—such as Alopecia (hair loss)—appear much less significant. A negative correlation with "Class" suggests that if a patient tests positive for alopecia alone, they are unlikely to be diabetic. Thus, alopecia is not a meaningful indicator of diabetes risk in isolation.
symptoms = ["Polyuria", "Polydipsia", "sudden weight loss", "weakness", "Polyphagia",
"Genital thrush", "visual blurring", "Itching", "Irritability",
"delayed healing", "partial paresis", "muscle stiffness", "Alopecia", "Obesity"]
df_binary = pd.get_dummies(Diabetes[symptoms], drop_first=True)
df_binary['Target'] = Diabetes['class'].apply(lambda x: 1 if x == "Positive" else 0)
# Calculate pairwise correlations
corr_matrix_binary = df_binary.corr()
# Plotting heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix_binary, cmap="PiYG", annot=True, linewidths=0.5, center=0)
plt.title("Pairwise Correlation Heatmap for Features and Target", fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
The pairwise correlation heatmap for binary features provides the following insights about the relationships between symptoms and diabetes:
Direct Symptom-Diabetes Correlation :
- The correlation values in the "Target" row show how strongly each symptom is associated with a diabetes diagnosis (positive correlation) or with the absence of diabetes (negative correlation).
- Positive Correlations (values closer to +1): Symptoms with higher positive correlations are more commonly present in individuals diagnosed with diabetes. For instance, if symptoms like Polyuria or Polydipsia have high positive correlations, this indicates these symptoms are strong indicators of diabetes.
- Negative Correlations (values closer to -1): Symptoms with negative correlations may be more frequent in individuals without diabetes. For instance, if Alopecia shows a negative correlation, it could indicate that individuals with alopecia are less likely to be diagnosed with diabetes.
Inter-Symptom Relationships : Symptoms with high correlations to each other may indicate a tendency to co-occur. For example, if Polyuria and Polydipsia show a strong correlation with each other, it suggests these symptoms often appear together in diabetic patients, possibly due to similar physiological effects.Weak or Neutral
Correlations : Features with correlation values near zero with the target variable may not contribute much to diabetes prediction and could be less useful in diagnostic contexts. These features might represent common symptoms that don’t have a strong association with diabetes specifically, such as symptoms more related to other health issues.
Potential Predictive Indicators : The symptoms with the strongest positive or negative correlations with the target variable are the most useful for diagnosis and model prediction. Positive indicators (e.g., symptoms highly correlated with diabetes) could become focus points for early screening.
# Enhanced box plot with all dataset features in tooltips
genbox = px.box(
Diabetes,
y="Age",
x="class",
color="Gender",
points="all",
title="Age Distribution by Diabetes Status, Gender, and Additional Symptoms",
# Custom color mapping for gender
color_discrete_map={"Male": "blue", "Female": "pink"},
# Adding facets for additional segmentation (e.g., by "sudden weight loss")
facet_row="Polyuria", # Faceting by Polyuria (could change based on interest)
facet_col="Polydipsia", # Faceting by Polydipsia
# Including all relevant attributes as hover data for insight
hover_data={
"Polyuria": True,
"Polydipsia": True,
"sudden weight loss": True,
"weakness": True,
"Polyphagia": True,
"Genital thrush": True,
"visual blurring": True,
"Itching": True,
"Irritability": True,
"partial paresis": True,
"Alopecia": True,
"class": True
}
)
# Show the enhanced plot
genbox.show()
- The box plot shows that age and gender influence diabetes status, with younger females and older males showing distinct patterns.
- Symptoms like frequent urination (Polyuria) and excessive thirst (Polydipsia) are commonly seen in diabetes-positive cases, while symptoms like hair loss (Alopecia) are less common among them.
- This plot helps us ientify typical diabetes symptoms and points to specific combinations of age, gender, and symptoms that may assist in early detection of diabetes.
Feature Selection¶
- Feature selection is the process of identifying and selecting the most important features in a dataset. It aims to improve model performance by removing irrelevant or redundant features. This helps reduce overfitting, improve accuracy, and decrease computational cost.
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
# Perform Chi-Square feature selection
chi2_selector = SelectKBest(chi2, k='all')
chi2_selector.fit(X, y)
chi2_scores = chi2_selector.scores_
# Perform Random Forest-based feature importance
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X, y)
rf_importances = rf_model.feature_importances_
# Combine feature importance metrics
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Chi2 Score': chi2_scores,
'RF Importance': rf_importances
})
# Sort features by Random Forest Importance (descending)
feature_importance_sorted = feature_importance.sort_values(by='RF Importance', ascending=False)
# Display sorted feature importance
print("Feature Importance (Ordered by Random Forest Importance):")
print("\n")
print(feature_importance_sorted)
import matplotlib.pyplot as plt
# Sort features and scores by Chi-Square scores (descending)
chi2_sorted = feature_importance.sort_values(by='Chi2 Score', ascending=False)
chi2_features = chi2_sorted['Feature']
chi2_scores_sorted = chi2_sorted['Chi2 Score']
# Sort features and scores by Random Forest importance (descending)
rf_sorted = feature_importance.sort_values(by='RF Importance', ascending=False)
rf_features = rf_sorted['Feature']
rf_importances_sorted = rf_sorted['RF Importance']
# Plot Chi-Square Scores and Random Forest Importances
plt.figure(figsize=(14, 8))
# Chi-Square Scores plot
plt.subplot(1, 2, 1)
plt.barh(chi2_features, chi2_scores_sorted, color='skyblue')
plt.title('Chi-Square Feature Importance')
plt.xlabel('Chi-Square Score')
plt.gca().invert_yaxis() # Ensures highest priority is at the top
# Random Forest Importances plot
plt.subplot(1, 2, 2)
plt.barh(rf_features, rf_importances_sorted, color='lightcoral')
plt.title('Random Forest Feature Importance')
plt.xlabel('Feature Importance')
plt.gca().invert_yaxis() # Ensures highest priority is at the top
plt.tight_layout()
plt.show()
Feature Importance (Ordered by Random Forest Importance):
Feature Chi2 Score RF Importance
2 Polyuria 116.184593 0.203702
3 Polydipsia 120.785515 0.202658
1 Gender 38.747637 0.104876
0 Age 33.971724 0.095522
12 partial paresis 55.314286 0.055078
4 sudden weight loss 57.749309 0.052379
14 Alopecia 24.402793 0.044073
10 Irritability 35.334127 0.039392
6 Polyphagia 33.198418 0.030942
8 visual blurring 18.124571 0.030147
9 Itching 0.047826 0.029924
11 delayed healing 0.620188 0.028387
13 muscle stiffness 4.875000 0.026561
7 Genital thrush 4.914009 0.021007
5 weakness 12.724262 0.018950
15 Obesity 2.250284 0.016403
Chi-Square (Chi2): Looks at each feature (like Gender or Age) by itself to see if it has a strong, direct link to the outcome (like diabetes). If a feature doesn’t stand out alone, it gets a low score.
Random Forest (RF): Looks at how features work together. Even if Gender or Age don’t seem very important alone, they might play a big role when combined with other features (like sudden weight loss or Polyuria) to make better predictions.
So, Chi2 checks individual importance, while RF focuses on teamwork among the features.
Why Random Forest?
If your goal is statistical analysis and you need a quick, simple check, Chi2 might suffice. But if you’re building a predictive model, Random Forest provides richer insights into how features influence outcomes, especially when features interact or relationships are complex.
By combining both methods, you strike a balance between efficiency (Chi2) and effectiveness (RF). This approach avoids unnecessary complexity while ensuring you keep features that significantly impact the model.
PCA Analysis and Feature Reduction for Diabetes Prediction Model¶
- Dimensionality reduction refers to the process of reducing the number of input variables (features) in a dataset while retaining as much of the original information as possible.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Load the dataset
file_path = 'diabetesInfosys.csv'
pcadata = pd.read_csv(file_path)
data=pcadata
from sklearn.preprocessing import LabelEncoder
# Encode categorical features
encoder = LabelEncoder()
for col in data.columns:
if data[col].dtype == 'object':
data[col] = encoder.fit_transform(data[col])
# Separate features and target
X = data.drop(columns=['class'])
y = data['class']
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
variance_before_pca = X.var(axis=0) # Variance of each feature in the original data
print(f"Variance before PCA (for each feature):\n{variance_before_pca}")
print("-------------------------------------------------------------------")
# Apply PCA to retain 95% of variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
# Display explained variance
print(f"Explained variance by each component: {pca.explained_variance_ratio_}")
print(f"Total components selected: {pca.n_components_}")
print(f"Original shape: {X.shape}, Reduced shape: {X_pca.shape}")
# Plot cumulative explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.title('Cumulative Explained Variance by Principal Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()
# Identify the removed features based on the number of components retained
original_columns = X.columns # If X is a pandas DataFrame
retained_features_count = pca.n_components_
Variance before PCA (for each feature): Age 147.658126 Gender 0.233348 Polyuria 0.250467 Polydipsia 0.247780 sudden weight loss 0.243631 weakness 0.242978 Polyphagia 0.248522 Genital thrush 0.173648 visual blurring 0.247780 Itching 0.250300 Irritability 0.183948 delayed healing 0.248848 partial paresis 0.245680 muscle stiffness 0.234827 Alopecia 0.226171 Obesity 0.140863 dtype: float64 ------------------------------------------------------------------- Explained variance by each component: [0.24421092 0.13922824 0.09026398 0.0753711 0.0602457 0.05201242 0.04808062 0.04611141 0.04151837 0.03601345 0.0329961 0.03145423 0.03060287 0.02580352] Total components selected: 14 Original shape: (520, 16), Reduced shape: (520, 14)
Why PCA is Unnecessary for this Datasets?
- PCA is not needed , as the dataset has only 16 features, and applying it may lose important original feature information.
Model Building¶
evaluated six models:¶
- Logistic Regression
- Random Forest
- Gradient Boosting
- Support Vector Classifier (SVC)
- Extra trees
- Decision Tree
Each model was checked on Four metrics: Accuracy, Precision, Recall , and F1 Score.¶
- Accuracy: How often the model is correct.
- Precision: How many predicted positives are actually correct.
- Recall: How many actual positives the model correctly identified.
- F1 Score: F1 score balances precision and recall for accurate positive predictions.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import MinMaxScaler
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Scaling the data¶
- Scaling ensures that all features are on the same scale, preventing any feature from dominating the model due to its larger range. It improves model convergence, performance, and ensures fair contributions from each feature.
# Scale data for Logistic Regression and SVC
scaler = MinMaxScaler()
X_train_log_reg = scaler.fit_transform(X_train)
X_test_log_reg = scaler.transform(X_test)
Hyperparameter Tuning¶
- Hyperparameter tuning refers to the process of selecting the best values for the hyperparameters of a machine learning model.
- For hyperparameter tuning in our code, we used a GridSearchCV method.
- GridSearchCV tests different settings for a model, checks which one works best using your data, and picks the winner for the most accurate predictions.
# Define the hyperparameter grids for each model
param_grid_rf = {
'n_estimators': [100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5]
}
param_grid_gb = {
'n_estimators': [100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7]
}
param_grid_lr = {
'C': [0.1, 1, 10],
'solver': ['liblinear', 'saga']
}
param_grid_dt = {
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10],
'criterion': ['gini', 'entropy']
}
param_grid_svc = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto']
}
param_grid_et = {
'n_estimators': [100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
}
# Initialize models
rf = RandomForestClassifier(random_state=42)
gb = GradientBoostingClassifier(random_state=42)
lr = LogisticRegression(random_state=42)
dt = DecisionTreeClassifier(random_state=42)
svc = SVC(random_state=42)
et = ExtraTreesClassifier(random_state=42)
# Initialize GridSearchCV for each model
grid_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_gb = GridSearchCV(estimator=gb, param_grid=param_grid_gb, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_lr = GridSearchCV(estimator=lr, param_grid=param_grid_lr, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_dt = GridSearchCV(estimator=dt, param_grid=param_grid_dt, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_svc = GridSearchCV(estimator=svc, param_grid=param_grid_svc, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_et = GridSearchCV(estimator=et, param_grid=param_grid_et, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
# Fit the grid search to the data
grid_rf.fit(X_train, y_train)
grid_gb.fit(X_train, y_train)
grid_lr.fit(X_train_log_reg, y_train) # Using scaled data for Logistic Regression
grid_dt.fit(X_train, y_train)
grid_svc.fit(X_train_log_reg, y_train) # SVC needs scaled data
grid_et.fit(X_train, y_train)
# Print the best parameters for each model
print("Best parameters for Random Forest:", grid_rf.best_params_)
print("Best parameters for Gradient Boosting:", grid_gb.best_params_)
print("Best parameters for Logistic Regression:", grid_lr.best_params_)
print("Best parameters for Decision Tree:", grid_dt.best_params_)
print("Best parameters for SVC:", grid_svc.best_params_)
print("Best parameters for Extra Trees:", grid_et.best_params_)
# Best models obtained from GridSearchCV
best_rf = grid_rf.best_estimator_
best_gb = grid_gb.best_estimator_
best_lr = grid_lr.best_estimator_
best_dt = grid_dt.best_estimator_
best_svc = grid_svc.best_estimator_
best_et = grid_et.best_estimator_
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 18 candidates, totalling 54 fits
Fitting 3 folds for each of 6 candidates, totalling 18 fits
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best parameters for Random Forest: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 100}
Best parameters for Gradient Boosting: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
Best parameters for Logistic Regression: {'C': 1, 'solver': 'liblinear'}
Best parameters for Decision Tree: {'criterion': 'entropy', 'max_depth': 10, 'min_samples_split': 2}
Best parameters for SVC: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best parameters for Extra Trees: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Train and Evaluate Each Model¶
# Initialize a list to store results
results_list = []
# Train and evaluate each model
for name, model in {'Random Forest': best_rf, 'Gradient Boosting': best_gb, 'Logistic Regression': best_lr,
'Decision Tree': best_dt, 'SVC': best_svc,'Extra Trees': best_et}.items():
if name == 'Logistic Regression' or name == 'SVC':
# Logistic Regression and SVC require scaled data
model.fit(X_train_log_reg, y_train)
y_pred = model.predict(X_test_log_reg)
else:
# Other models work with original (unscaled) data
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
# Add results to the list
results_list.append({
'Model': name,
'Accuracy': accuracy,
'Precision': precision,
'Recall': recall,
'F1 Score': f1
})
Create a DataFrame with the Results¶
# Step 4: Create a DataFrame with the results
results_df = pd.DataFrame(results_list)
# Display the final results DataFrame
print(results_df)
print("-----------------------------------------------------------------------------------")
top_models = results_df.sort_values(by='Accuracy', ascending=False).head(2)
# Print the top two models with the highest accuracy
print("\nTop models with the highest accuracy:")
print(top_models)
Model Accuracy Precision Recall F1 Score
0 Random Forest 0.990385 1.000000 0.985915 0.992908
1 Gradient Boosting 0.980769 1.000000 0.971831 0.985714
2 Logistic Regression 0.932692 0.944444 0.957746 0.951049
3 Decision Tree 0.980769 1.000000 0.971831 0.985714
4 SVC 0.971154 0.985714 0.971831 0.978723
5 Extra Trees 0.990385 1.000000 0.985915 0.992908
-----------------------------------------------------------------------------------
Top models with the highest accuracy:
Model Accuracy Precision Recall F1 Score
0 Random Forest 0.990385 1.0 0.985915 0.992908
5 Extra Trees 0.990385 1.0 0.985915 0.992908
import seaborn as sna
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy', data=results_df, palette='viridis')
# Title and labels
plt.title('Accuracy by Model')
plt.ylabel('Accuracy')
plt.xticks(rotation=45, ha="right")
# Show the plot
plt.tight_layout()
plt.show()
- Although both Random Forest and Extra Trees have the same metrics in terms of accuracy, precision, recall, and F1 score, Random Forest is preferred
Why Random Forest Over Extra Trees?
In our diabetes prediction model, we choose Random Forest over Extra Trees based on three key metrics:
Better Interpretability :
- Random Forest provides clear insights into the importance of features like Polyuria and Polydipsia, helping us understand which factors most influence diabetes prediction. For medical applications, transparency in model decision-making is crucial, as it helps healthcare providers trust the model’s results.
Less Randomness in Decisions :
- Extra Trees add more randomness by using fully randomized splits in trees, which can reduce interpretability. In healthcare, stability and consistency are more important than squeezing out slight accuracy gains, especially when the model's decisions affect lives.
Stability :
- Random Forest tends to be more stable across different data splits due to the way it combines predictions from multiple trees with bootstrapping.
Conclusion :
Random Forest offers a balance of performance, interpretability, and stability. This makes it the preferred choice for understanding and explaining the factors influencing diabetes diagnosis, ensuring the model's decisions are trustworthy and reliable in real-world medical settings.